[SPARK-15915][SQL] Logical plans should use canonicalized plan when override sameResult.#13638
[SPARK-15915][SQL] Logical plans should use canonicalized plan when override sameResult.#13638ueshin wants to merge 5 commits intoapache:masterfrom
Conversation
…not using canonicalized plan to compare can cacheTable.
|
Test build #60389 has finished for PR 13638 at commit
|
|
this looks reasonable to me, cc @marmbrus is there any corner cases? |
| val canonicalized = plan.canonicalized | ||
| cachedData.foreach { | ||
| case data if data.plan.collect { case p if p.sameResult(plan) => p }.nonEmpty => | ||
| case data if data.plan.collect { case p if p.sameResult(canonicalized) => p }.nonEmpty => |
There was a problem hiding this comment.
I think this is redundant, sameResult already compares the canonicalized plan.
There was a problem hiding this comment.
I don't think so.
For example, if the cached plan is LocalRelation (which is canonicalized) and the plan argument is SubqueryAlias(LocalRelation) (which is not canonicalized), it will fail to find the same-result plan.
There was a problem hiding this comment.
the implementation of sameResult:
def sameResult(plan: PlanType): Boolean = {
val left = this.canonicalized
val right = plan.canonicalized
......
Seems we do canonicalize plans?
There was a problem hiding this comment.
Some plans like LocalRelation, LogicalRDD, etc. override the sameResult and they don't use canonicalized plan to compare.
For example, LocalRelation overrides like as follows:
override def sameResult(plan: LogicalPlan): Boolean = plan match {
case LocalRelation(otherOutput, otherData) =>
otherOutput.map(_.dataType) == output.map(_.dataType) && otherData == data
case _ => false
}
If the plan is SubqueryAlias(LocalRelation), the method returns false.
There was a problem hiding this comment.
Should we just fix those implementations?
There was a problem hiding this comment.
Hmm, yes, we should for now, but we can't force to use canonicalized plan and it is difficult to check all implementations so I thought CacheManager should canonicalize plans.
There was a problem hiding this comment.
I guess I would prefer to fix in in sameResult if possible so that all uses of that function will benefit.
There was a problem hiding this comment.
I see, I'll fix them, too.
|
Seems reasonable. Is this a regression from 1.6? |
| tableName: Option[String] = None, | ||
| storageLevel: StorageLevel = MEMORY_AND_DISK): Unit = writeLock { | ||
| val planToCache = query.queryExecution.analyzed | ||
| val planToCache = query.queryExecution.analyzed.canonicalized |
There was a problem hiding this comment.
do we still need these changes? LogicalPlan.canonicalized is a lazy val and we don't have performance penalty even we use un-canonicalized plan as key.
There was a problem hiding this comment.
It seems we don't need them for now.
I'll revert the changes.
|
Updated pr title, description and a test name. |
|
Test build #60451 has finished for PR 13638 at commit
|
|
@marmbrus It seems this is not a regression from |
|
Test build #60458 has finished for PR 13638 at commit
|
|
Test build #60459 has finished for PR 13638 at commit
|
|
LGTM, @marmbrus should we backport it to 1.6? |
|
Yeah, sounds reasonable. Merging to master, 2.0 and 1.6. |
…verride sameResult.
## What changes were proposed in this pull request?
`DataFrame` with plan overriding `sameResult` but not using canonicalized plan to compare can't cacheTable.
The example is like:
```
val localRelation = Seq(1, 2, 3).toDF()
localRelation.createOrReplaceTempView("localRelation")
spark.catalog.cacheTable("localRelation")
assert(
localRelation.queryExecution.withCachedData.collect {
case i: InMemoryRelation => i
}.size == 1)
```
and this will fail as:
```
ArrayBuffer() had size 0 instead of expected size 1
```
The reason is that when do `spark.catalog.cacheTable("localRelation")`, `CacheManager` tries to cache for the plan wrapped by `SubqueryAlias` but when planning for the DataFrame `localRelation`, `CacheManager` tries to find cached table for the not-wrapped plan because the plan for DataFrame `localRelation` is not wrapped.
Some plans like `LocalRelation`, `LogicalRDD`, etc. override `sameResult` method, but not use canonicalized plan to compare so the `CacheManager` can't detect the plans are the same.
This pr modifies them to use canonicalized plan when override `sameResult` method.
## How was this patch tested?
Added a test to check if DataFrame with plan overriding sameResult but not using canonicalized plan to compare can cacheTable.
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes #13638 from ueshin/issues/SPARK-15915.
(cherry picked from commit c5b7355)
Signed-off-by: Michael Armbrust <michael@databricks.com>
|
Hmmm, does not apply cleanly to 1.6. @ueshin if you have time it might be nice to backport. |
|
@marmbrus Thank you for merging this. |
…n when override sameResult. ## What changes were proposed in this pull request? This pr is a backport of #13638 for `branch-1.6`. ## How was this patch tested? Added the same test as #13638 modified for `branch-1.6`. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #13668 from ueshin/issues/SPARK-15915_1.6.
…n when override sameResult. ## What changes were proposed in this pull request? This pr is a backport of apache#13638 for `branch-1.6`. ## How was this patch tested? Added the same test as apache#13638 modified for `branch-1.6`. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes apache#13668 from ueshin/issues/SPARK-15915_1.6. (cherry picked from commit cffc080)
What changes were proposed in this pull request?
DataFramewith plan overridingsameResultbut not using canonicalized plan to compare can't cacheTable.The example is like:
and this will fail as:
The reason is that when do
spark.catalog.cacheTable("localRelation"),CacheManagertries to cache for the plan wrapped bySubqueryAliasbut when planning for the DataFramelocalRelation,CacheManagertries to find cached table for the not-wrapped plan because the plan for DataFramelocalRelationis not wrapped.Some plans like
LocalRelation,LogicalRDD, etc. overridesameResultmethod, but not use canonicalized plan to compare so theCacheManagercan't detect the plans are the same.This pr modifies them to use canonicalized plan when override
sameResultmethod.How was this patch tested?
Added a test to check if DataFrame with plan overriding sameResult but not using canonicalized plan to compare can cacheTable.